Claire Vandiedonck & Sandrine Caburet
Tutorial
In this tutorial, we will further analyse the Differential Expression between T1D patients and their related unaffected controls using microarrays transcriptome data ("Fil Rouge" UE5 and Practical Sessions in UE1).
We will first load the data and metadata.
We will also use two powerful packages:
ComplexHeatMap to generate enhanced heatmaps: with sample annotations on the sides, with hierarchical clustering dendograms generated with several distance metrics and clusterization aggregration methods
clusteProfiler to perform further enrichment analyses on sets of genes of interest
We will finally prepare input files for GSEA: Gene Set Enrichment Analysis with the GSEA Java stand-alone application.
Avant d'aller plus loin
# Code cell n°1
#setwd("~/meg_m1_gb_r")
myworking_path <- "./" # modify as necessary
datapath <- "/srv/data/meg-m1-gb/DGE_results/"
We install required libraries and load them.
# Code cell n°2
# list the required libraries
requiredLib <- c(
"statmod",
"RColorBrewer",
"dendextend",
"stats",
"grDevices",
"BiocManager",
"writexl",
"ggnewscale",
"ggupset",
"writexl",
"readxl",
"ggridges",
"gplots")
requiredBiocLib <- c("limma",
"org.Hs.eg.db",
"clusterProfiler",
"enrichplot",
"ComplexHeatmap",
"ReactomePA")
# install required libraries if not
for (lib in requiredLib) {
if (!require(lib, character.only = TRUE, quiet = TRUE)) {
install.packages(lib, quiet = TRUE)
}
}
for( lib in requiredBiocLib) {
if (!require(lib, character.only = TRUE, quiet = TRUE)) {
BiocManager::install(lib, quiet = TRUE)
}
}
# load libraries
message("Loading required libraries")
for (lib in requiredLib) {
library(lib, character.only = TRUE)}
for (lib in requiredBiocLib) {
library(lib, character.only = TRUE)}
# kepp tracks of R and packages for this study
sessionInfo()
rm(lib, requiredLib, requiredBiocLib)
---------------------
Welcome to dendextend version 1.16.0
Type citation('dendextend') for how to cite the package.
Type browseVignettes(package = 'dendextend') for the package vignette.
The github page is: https://github.com/talgalili/dendextend/
Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
You may ask questions at stackoverflow, use the r and dendextend tags:
https://stackoverflow.com/questions/tagged/dendextend
To suppress this message use: suppressPackageStartupMessages(library(dendextend))
---------------------
Attaching package: ‘dendextend’
The following object is masked from ‘package:stats’:
cutree
Bioconductor version '3.14' is out-of-date; the current release version '3.16'
is available with R version '4.2'; see https://bioconductor.org/install
Attaching package: ‘gplots’
The following object is masked from ‘package:stats’:
lowess
Attaching package: ‘BiocGenerics’
The following object is masked from ‘package:limma’:
plotMA
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, append, as.data.frame, basename, cbind, colnames,
dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
union, unique, unsplit, which.max, which.min
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: ‘S4Vectors’
The following object is masked from ‘package:gplots’:
space
The following objects are masked from ‘package:base’:
expand.grid, I, unname
clusterProfiler v4.2.2 For help: https://yulab-smu.top/biomedical-knowledge-mining-book/
If you use clusterProfiler in published research, please cite:
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141
Attaching package: ‘clusterProfiler’
The following object is masked from ‘package:AnnotationDbi’:
select
The following object is masked from ‘package:IRanges’:
slice
The following object is masked from ‘package:S4Vectors’:
rename
The following object is masked from ‘package:stats’:
filter
========================================
ComplexHeatmap version 2.10.0
Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
Github page: https://github.com/jokergoo/ComplexHeatmap
Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
If you use it in published research, please cite:
Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional
genomic data. Bioinformatics 2016.
The new InteractiveComplexHeatmap package can directly export static
complex heatmaps into an interactive Shiny app with zero effort. Have a try!
This message can be suppressed by:
suppressPackageStartupMessages(library(ComplexHeatmap))
========================================
ReactomePA v1.38.0 For help: https://yulab-smu.top/biomedical-knowledge-mining-book/
If you use ReactomePA in published research, please cite:
Guangchuang Yu, Qing-Yu He. ReactomePA: an R/Bioconductor package for reactome pathway analysis and visualization. Molecular BioSystems 2016, 12(2):477-479
Loading required libraries
R version 4.1.3 (2022-03-10) Platform: x86_64-conda-linux-gnu (64-bit) Running under: Ubuntu 18.04.4 LTS Matrix products: default BLAS/LAPACK: /srv/conda/envs/notebook/lib/libopenblasp-r0.3.21.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] grid stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] ReactomePA_1.38.0 ComplexHeatmap_2.10.0 enrichplot_1.14.2 [4] clusterProfiler_4.2.2 org.Hs.eg.db_3.14.0 AnnotationDbi_1.56.2 [7] IRanges_2.28.0 S4Vectors_0.32.4 Biobase_2.54.0 [10] BiocGenerics_0.40.0 limma_3.50.3 gplots_3.1.3 [13] ggridges_0.5.4 readxl_1.4.1 ggupset_0.3.0 [16] ggnewscale_0.4.8 writexl_1.4.1 BiocManager_1.30.19 [19] dendextend_1.16.0 RColorBrewer_1.1-3 statmod_1.4.37 loaded via a namespace (and not attached): [1] backports_1.4.1 uuid_1.1-0 shadowtext_0.1.2 [4] circlize_0.4.15 fastmatch_1.1-3 plyr_1.8.8 [7] igraph_1.3.5 repr_1.1.4 lazyeval_0.2.2 [10] splines_4.1.3 BiocParallel_1.28.3 GenomeInfoDb_1.30.1 [13] ggplot2_3.4.0 digest_0.6.30 foreach_1.5.2 [16] yulab.utils_0.0.5 htmltools_0.5.3 GOSemSim_2.20.0 [19] viridis_0.6.2 GO.db_3.14.0 fansi_1.0.3 [22] checkmate_2.1.0 magrittr_2.0.3 memoise_2.0.1 [25] cluster_2.1.4 doParallel_1.0.17 Biostrings_2.62.0 [28] graphlayouts_0.8.3 matrixStats_0.62.0 colorspace_2.0-3 [31] rappdirs_0.3.3 blob_1.2.3 ggrepel_0.9.2 [34] dplyr_1.0.10 crayon_1.5.2 RCurl_1.98-1.9 [37] jsonlite_1.8.3 graph_1.72.0 scatterpie_0.1.8 [40] iterators_1.0.14 ape_5.6-2 glue_1.6.2 [43] polyclip_1.10-4 gtable_0.3.1 zlibbioc_1.40.0 [46] XVector_0.34.0 GetoptLong_1.0.5 graphite_1.40.0 [49] shape_1.4.6 scales_1.2.1 DOSE_3.20.1 [52] DBI_1.1.3 Rcpp_1.0.9 viridisLite_0.4.1 [55] clue_0.3-62 gridGraphics_0.5-1 tidytree_0.4.1 [58] reactome.db_1.77.0 bit_4.0.4 httr_1.4.4 [61] fgsea_1.20.0 pkgconfig_2.0.3 farver_2.1.1 [64] utf8_1.2.2 ggplotify_0.1.0 tidyselect_1.2.0 [67] rlang_1.0.6 reshape2_1.4.4 munsell_0.5.0 [70] cellranger_1.1.0 tools_4.1.3 cachem_1.0.6 [73] downloader_0.4 cli_3.4.1 generics_0.1.3 [76] RSQLite_2.2.18 evaluate_0.18 stringr_1.4.1 [79] fastmap_1.1.0 ggtree_3.7.1.001 bit64_4.0.5 [82] tidygraph_1.2.2 caTools_1.18.2 purrr_0.3.5 [85] KEGGREST_1.34.0 ggraph_2.1.0 nlme_3.1-160 [88] aplot_0.1.8 DO.db_2.9 compiler_4.1.3 [91] png_0.1-7 treeio_1.23.0 tibble_3.1.8 [94] tweenr_2.0.2 stringi_1.7.8 lattice_0.20-45 [97] IRdisplay_1.1 Matrix_1.5-3 vctrs_0.5.0 [100] pillar_1.8.1 lifecycle_1.0.3 GlobalOptions_0.1.2 [103] data.table_1.14.4 bitops_1.0-7 patchwork_1.1.2 [106] qvalue_2.26.0 R6_2.5.1 KernSmooth_2.23-20 [109] gridExtra_2.3 codetools_0.2-18 MASS_7.3-58.1 [112] gtools_3.9.3 assertthat_0.2.1 rjson_0.2.21 [115] withr_2.5.0 GenomeInfoDbData_1.2.7 parallel_4.1.3 [118] ggfun_0.0.8 IRkernel_1.2 tidyr_1.2.1 [121] pbdZMQ_0.3-8 ggforce_0.4.1 base64enc_0.1-3
At the end of the Fil Rouge on microarrays, we ended up with several R objects that we will reload in this R Session:
# Code cell n°3
load(paste(datapath,"probes.RData",sep = ""))
ls()
str(probes)
gene_list <- probes$TargetID
cat("There are ", length(unique(gene_list)), "unique genes\n")
'data.frame': 47323 obs. of 9 variables: $ ProbeID : int 6450255 2570615 6370619 2600039 2650615 5340672 2000519 3870044 7050209 1580181 ... $ CHROMOSOME : chr "7" "19" "19" "10" ... $ CYTOBAND : chr "7p15.3e" "19q13.43c" "19q13.43c" "10q11.23c" ... $ PROBE_CHR_ORIENTATION: chr "-" "-" "-" "-" ... $ PROBE_COORDINATES : chr "20147187-20147236" "63548541-63548590" "63549180-63549229" "52566586-52566635" ... $ REFSEQ_ID : chr "NM_182762.2" "NM_130786.2" "NM_130786.2" "NM_138932.1" ... $ ENTREZ_GENE_ID : int 346389 1 1 29974 29974 29974 23784 23784 23784 54715 ... $ TargetID : chr "7A5" "A1BG" "A1BG" "A1CF" ... $ SYMBOL : chr "7A5" "A1BG" "A1BG" "A1CF" ... There are 34694 unique genes
# Code cell n°4
load(paste(datapath,"samples_info.RData", sep = ""))
ls()
str(samples_info)
'data.frame': 264 obs. of 8 variables: $ array.labels: chr "5753669129_B" "5753669129_C" "5753669129_E" "5753669129_K" ... $ PedID : chr "1" "1" "1" "1" ... $ ID : chr "1" "1" "1" "1" ... $ Status : chr "2" "2" "2" "2" ... $ Stim : chr "0" "0" "6" "6" ... $ Full : chr "1_1_2_0" "1_1_2_0" "1_1_2_6" "1_1_2_6" ... $ Sex : chr "1" "1" "1" "1" ... $ Age : int 10 10 10 10 10 10 18 18 18 18 ...
We load the norm.quant object obtained after log2 transformation and between-samples quantile normalization.
# Code cell n°5
load(paste(datapath, "norm.quant.RData", sep = ""))
ls()
str(norm.quant)
norm.quant[1:10, 1:10]
num [1:47323, 1:264] 7.35 7.09 6.3 6.57 6.49 ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:47323] "6450255" "2570615" "6370619" "2600039" ... ..$ : chr [1:264] "5753669129_B" "5753669129_C" "5753669129_E" "5753669129_K" ...
| 5753669129_B | 5753669129_C | 5753669129_E | 5753669129_K | 5753669129_A | 5753669129_F | 5753669095_B | 5753669095_F | 5753669095_G | 5753669095_K | |
|---|---|---|---|---|---|---|---|---|---|---|
| 6450255 | 7.351549 | 7.056063 | 7.052947 | 7.398073 | 7.223727 | 6.866752 | 7.019286 | 6.986984 | 7.289715 | 7.082987 |
| 2570615 | 7.087770 | 6.909035 | 6.874572 | 6.794254 | 6.635361 | 6.766837 | 6.650753 | 6.841188 | 6.816850 | 6.684924 |
| 6370619 | 6.299524 | 6.754202 | 6.550678 | 6.574270 | 6.625025 | 6.473627 | 6.401915 | 6.318659 | 6.461961 | 6.427141 |
| 2600039 | 6.572519 | 6.609202 | 6.568940 | 6.391801 | 6.435965 | 6.782794 | 6.468696 | 6.537351 | 6.494413 | 6.615691 |
| 2650615 | 6.492314 | 6.494968 | 6.773884 | 6.507731 | 6.935357 | 6.428773 | 6.606865 | 6.618028 | 6.434543 | 6.448600 |
| 5340672 | 6.194034 | 6.138695 | 6.069135 | 6.192778 | 6.247462 | 6.235562 | 5.909282 | 6.065657 | 6.056266 | 6.199995 |
| 2000519 | 6.717292 | 6.453334 | 6.883398 | 6.958528 | 6.913643 | 6.357944 | 6.649513 | 6.774690 | 6.485854 | 6.731937 |
| 3870044 | 5.894237 | 5.753032 | 5.981951 | 5.991817 | 5.744194 | 5.728102 | 5.842047 | 5.699204 | 5.672224 | 5.871414 |
| 7050209 | 6.733664 | 6.699775 | 6.569374 | 6.527873 | 6.850950 | 6.769487 | 6.569142 | 6.539663 | 6.786768 | 6.549867 |
| 1580181 | 6.740298 | 6.402660 | 6.329971 | 6.402146 | 6.536189 | 6.628187 | 6.624250 | 6.416259 | 6.415431 | 6.628095 |
Each row corresponds to an Illumina probeID (probe Sets), while each column name corresponds to an Illumina array sample.
The correspondance between probeID and gene names is provided in the probes object, while the experimental design matrix of samples is provided in the samples_info object.
At the end of the DGE analysis, we stored all results in a list limma.full.outs with 6 dataframes:
1. with the result of the full model analysis: testing whether gene expression varies in at least one condition
2. with the result of the first contrast for the DGE analysis
3. etc...
We saved this list in an excel document DGE_T1D_microarrays.xlsx, each tab corresponding to one list element (i.e. one tab corresponding to one analysis).
Read Excel function.# Code cell n°6
limma.full.outs <- list()
sheet_names <- readxl::excel_sheets(paste(datapath, "DGE_T1D_microarrays.xlsx", sep = ""))
for (sheet in sheet_names) {
limma.full.outs[[sheet]] <- as.data.frame(read_xlsx(paste(datapath, "DGE_T1D_microarrays.xlsx", sep = ""), sheet = sheet))
}
This list contains the following 6 dataframes:
# Code cell n°7
print(names(limma.full.outs))
[1] "limma.fullmodel.out" "limma.Pat.H0.out" "limma.Pat.H24.out" [4] "limma.Pat.H6.out" "limma.Pat.H624.out" "limma.Pat.out"
# Code cell n°8
str(limma.full.outs[[1]])
'data.frame': 47323 obs. of 18 variables: $ ProbeID : chr "3120474" "3310376" "630725" "1430709" ... $ Pat.H0 : num -0.58 0.327 -0.569 -0.285 0.328 ... $ Pat.H6 : num 0.1486 0.0132 -0.2268 -0.0714 0.0257 ... $ Pat.H24 : num 0.00847 -0.07835 -0.1993 -0.35116 0.16611 ... $ Pat.H6H24 : num 0.1571 -0.0652 -0.4261 -0.4226 0.1918 ... $ Patient : num -1.583 0.915 -2.133 -1.277 1.175 ... $ AveExpr : num 7.81 11.45 7.92 6.97 7.27 ... $ F : num 21.6 17 14.2 14 13.9 ... $ P.Value : num 1.55e-12 3.86e-10 1.29e-08 1.64e-08 1.83e-08 ... $ adj.P.Val : num 7.34e-08 9.14e-06 1.60e-04 1.60e-04 1.60e-04 ... $ CHROMOSOME : chr "22" "11" "12" "17" ... $ CYTOBAND : chr "22q11.23c" "11q12.1a" "12q15a" "17q11.2e" ... $ PROBE_CHR_ORIENTATION: chr "+" "-" "-" "+" ... $ PROBE_COORDINATES : chr "23923092-23923141" "57296016-57296065" "68548736-68548785" "28348895-28348909:28348910-28348944" ... $ REFSEQ_ID : chr "XM_371461.4" "NM_012456.2" "NM_000619.2" "NM_173847.3" ... $ ENTREZ_GENE_ID : num 85379 26519 3458 124912 638 ... $ TargetID : chr "KIAA1671" "TIMM10" "IFNG" "SPACA3" ... $ SYMBOL : chr "KIAA1671" "TIMM10" "IFNG" "SPACA3" ...
We have the log2 mean expression per condition, followed by the average expression in all conditions, then the statistic value F and the corresponding P.Val and adj.P.Val after Benjamini-Hochberg correction. Then you have probes annotations.
# Code cell n°9
str(limma.full.outs[[6]])
'data.frame': 47323 obs. of 17 variables: $ ProbeID : chr "3120474" "3310376" "630725" "6280440" ... $ logFC : num -1.583 0.915 -2.133 1.175 0.665 ... $ CI.L : num -1.979 0.657 -2.797 0.799 0.449 ... $ CI.R : num -1.186 1.174 -1.47 1.551 0.882 ... $ AveExpr : num 7.81 11.45 7.92 7.27 7.86 ... $ t : num -7.86 6.97 -6.33 6.15 6.04 ... $ P.Value : num 9.51e-14 2.48e-11 1.02e-09 2.86e-09 5.02e-09 ... $ adj.P.Val : num 4.50e-09 5.86e-07 1.62e-05 3.39e-05 4.75e-05 ... $ B : num 20 14.9 11.5 10.6 10.1 ... $ CHROMOSOME : chr "22" "11" "12" "22" ... $ CYTOBAND : chr "22q11.23c" "11q12.1a" "12q15a" "22q13.2c" ... $ PROBE_CHR_ORIENTATION: chr "+" "-" "-" "+" ... $ PROBE_COORDINATES : chr "23923092-23923141" "57296016-57296065" "68548736-68548785" "41855291-41855340" ... $ REFSEQ_ID : chr "XM_371461.4" "NM_012456.2" "NM_000619.2" "NM_001197.3" ... $ ENTREZ_GENE_ID : num 85379 26519 3458 638 282969 ... $ TargetID : chr "KIAA1671" "TIMM10" "IFNG" "BIK" ... $ SYMBOL : chr "KIAA1671" "TIMM10" "IFNG" "BIK" ...
We have the log2 Fold Change between the two conditions (here log2(Pat) - log2(Cont)), with its confidence interval, followed by the average expression in both conditions, then the statistic value t, the P.Value and adj.P.Val after Benjamini-Hochberg correction and the B statiscics (the one used for pvalue). Then you have probe annotations.
ComplexHeatmap¶There are two main packages to draw enhanced heatmaps:
ComplexHeatmap is the most flexible tool to draw heatmaps with a lot of options, notably to add annotations.
Below is presented a quick example on ComplexHeatmap usage.
# Code cell n°10
probes_sig <- subset(limma.full.outs$limma.fullmodel.out, adj.P.Val < 0.01 )[, "ProbeID"]
length(probes_sig)
mat_full <- norm.quant[row.names(norm.quant) %in% probes_sig,]
str(mat_full)
colnames(mat_full) <- samples_info$Status
row.names(mat_full) <- probes$TargetID[probes$ProbeID %in% probes_sig]
num [1:34, 1:264] 7.84 7.41 7.54 10.67 9.16 ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:34] "1820347" "6280440" "2510546" "6650079" ... ..$ : chr [1:264] "5753669129_B" "5753669129_C" "5753669129_E" "5753669129_K" ...
coolmap() function could draw a simple heatmap : # Code cell n°11
options(repr.plot.width = 10, repr.plot.height = 10)
limma::coolmap(mat_full, col = "redblue",cexCol = 0.6)
To use ComplexHeatmap, the dataset must be a matrix with samples in columns and genes in rows. We thus transpose our matrix mat_full.
In addition, we use the function scale() to center our data with a Z score:
# Code cell n°12
tmat_full <- t(apply(mat_full, 1, scale))
dist(). The correlation distance is defined as 1 - cor(x, y, method). See there for further details.# Code cell n°13
ComplexHeatmap::Heatmap(tmat_full,
name = "Z-score",
clustering_distance_rows = "pearson",
column_title = "pre-defined distance method (1 - pearson)")
clustering_method_rows and clustering_method_columns. Possible methods are those supported in hclust() function: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).# Code cell n°14
ComplexHeatmap::Heatmap(tmat_full,
name = "Z-score",
clustering_distance_rows = "pearson",
column_title = "pre-defined distance method (1 - pearson)",
clustering_method_rows = "ward.D")
One nice usage of ComplexHeatmap is the possibility to add custom annotations: ha for "heatmap annotation" is the object name used in ComplexHeatmap tutorial.
anno_points() or anno_barplot() or qualitative with a dataframe. The col argument is used to specify the colors of the different categorical values. For the quantitative values, you can add points or boxplots for example.# Code cell n°15
ha <- HeatmapAnnotation(age = anno_points(samples_info$Age),
df = data.frame(Sex = samples_info$Sex,
Stim = samples_info$Stim,
Status = samples_info$Status),
col = list(Status = c("1" = "blue", "2" = "red"),
Stim = c("0" = "lightgreen", "6"= "green", "24" = "darkgreen"),
Sex = c("1" = "lightblue", "2" = "pink"))
)
ComplexHeatmap::Heatmap(tmat_full,
name = "Z-score",
top_annotation = ha,
row_names_gp = gpar(fontsize = 10)
)
Below is another example without clustering the samples, saving the plot in an R object then drawing the heatmap with the draw() function.
# Code cell n°16
h_tx <- ComplexHeatmap::Heatmap(tmat_full,
name = "Z-score",
top_annotation = ha,
row_names_gp = gpar(fontsize = 10),
cluster_columns = FALSE)
ComplexHeatmap::draw(h_tx)
You could also add annotations to genes: adding gene level expression for example (see next example), or coloring genes given gene sets.
# Code cell n°17
ha_col <- columnAnnotation(age = anno_points(samples_info$Age),
df = data.frame(Sex = samples_info$Sex,
Stim = samples_info$Stim,
Status = samples_info$Status),
col = list(Status = c("1" = "blue", "2" = "red"),
Stim = c("0" = "lightgreen", "6"= "green", "24" = "darkgreen"),
Sex = c("1" = "lightblue", "2" = "pink"))
)
ha_row <- rowAnnotation(mean_expr = anno_points(subset(limma.full.outs$limma.fullmodel.out, adj.P.Val < 0.01 )$AveExpr)
)
# Code cell n°18
h_tmat_full <- ComplexHeatmap::Heatmap(tmat_full,
name = "Z-score",
top_annotation = ha_col,
right_annotation = ha_row,
row_names_gp = gpar(fontsize = 10),
cluster_columns = FALSE)
ComplexHeatmap::draw(h_tmat_full)
=> There are many more possibilities with ComplexHeatmap, for example to play with color gradient, to display dendograms with different colors, to split the heatmap with the clusters, to draw heatmaps next to each other in a composite plot... Just refer to its guide!
Note: You could also try to play with two other packages to draw interactive heatmaps:
InteractiveComplexHeatmapheatmaply as explained in this short tutorial.clusterProfiler¶There are two main packages to study gene sets enrichment in R.
gprofiler2: a CRAN packages to do similar analyses as in the web interface of G:Profiler
clusterProfiler available in Bioconductor.
clusterProfiler is by far the most complete and flexible tool to perform ORAs (Over-Representation Analyses) that are single-gene enrichment methods, or to perform SAFE (Significance Analysis of Function and Expression) like GSEA.
Its guide is available at both these links:
clusterProfiler can be used for several ORAs on different datasets: Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome, Disease Ontology (DO), Disease Gene Network (DisGeNET), wikiPathways, Molecular Signatures Database (MSigDb) and custom datasets.
We will illustrate clusterProfiler usage on GO, Reactome and KEGG.
clusterProfiler needs a set of unique genes.
# Code cell n°19
mygenes <- unique(row.names(mat_full))
The function to run it for GO ebrichment is enrichGO(). For the ont (Ontology) argument, we can specify either molecular function ("MF"), cellular components ("CC"), biological processes ("BP") or "ALL".
Here we use org.Hs.eg.db as the database collecting all genes for the human genome with their different identifiers. This database is available in a bioconductor package: https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html
# Code cell n°20
enr_go <- enrichGO(gene = mygenes,
ont ="ALL",
OrgDb = org.Hs.eg.db,
universe = unique(probes$TargetID),
keyType = "SYMBOL",
minGSSize = 3,
maxGSSize = 800,
pvalueCutoff = 0.1,
pAdjustMethod = "BH")
Be patient, the above command can last a minute or so
We can visualize for example to top 20 enrichments:
# Code cell n°21
head(enr_go, 20)
| ONTOLOGY | ID | Description | GeneRatio | BgRatio | pvalue | p.adjust | qvalue | geneID | Count | |
|---|---|---|---|---|---|---|---|---|---|---|
| <chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | <chr> | <int> | |
| GO:0060333 | BP | GO:0060333 | interferon-gamma-mediated signaling pathway | 3/23 | 27/15612 | 7.983325e-06 | 0.008063158 | 0.006168169 | IFNG/IRF1/STAT1 | 3 |
| GO:0034341 | BP | GO:0034341 | response to interferon-gamma | 4/23 | 138/15612 | 4.541988e-05 | 0.022937040 | 0.017546417 | IFNG/IRF1/STAT1/UBD | 4 |
| GO:0051770 | BP | GO:0051770 | positive regulation of nitric-oxide synthase biosynthetic process | 2/23 | 16/15612 | 2.460307e-04 | 0.054114341 | 0.041396484 | IFNG/STAT1 | 2 |
| GO:0051767 | BP | GO:0051767 | nitric-oxide synthase biosynthetic process | 2/23 | 20/15612 | 3.881544e-04 | 0.054114341 | 0.041396484 | IFNG/STAT1 | 2 |
| GO:0051769 | BP | GO:0051769 | regulation of nitric-oxide synthase biosynthetic process | 2/23 | 20/15612 | 3.881544e-04 | 0.054114341 | 0.041396484 | IFNG/STAT1 | 2 |
| GO:0051607 | BP | GO:0051607 | defense response to virus | 4/23 | 242/15612 | 3.954710e-04 | 0.054114341 | 0.041396484 | IFNG/IRF1/RNASE6/STAT1 | 4 |
| GO:0140546 | BP | GO:0140546 | defense response to symbiont | 4/23 | 242/15612 | 3.954710e-04 | 0.054114341 | 0.041396484 | IFNG/IRF1/RNASE6/STAT1 | 4 |
| GO:0035458 | BP | GO:0035458 | cellular response to interferon-beta | 2/23 | 21/15612 | 4.286284e-04 | 0.054114341 | 0.041396484 | IRF1/STAT1 | 2 |
| GO:0071346 | BP | GO:0071346 | cellular response to interferon-gamma | 3/23 | 116/15612 | 6.350974e-04 | 0.056975513 | 0.043585228 | IFNG/IRF1/STAT1 | 3 |
| GO:0030097 | BP | GO:0030097 | hemopoiesis | 6/23 | 762/15612 | 6.547515e-04 | 0.056975513 | 0.043585228 | CD79B/IFNG/IRF1/SH3PXD2A/STAT1/UBD | 6 |
| GO:0045589 | BP | GO:0045589 | regulation of regulatory T cell differentiation | 2/23 | 26/15612 | 6.603882e-04 | 0.056975513 | 0.043585228 | IFNG/IRF1 | 2 |
| GO:0035456 | BP | GO:0035456 | response to interferon-beta | 2/23 | 28/15612 | 7.667074e-04 | 0.056975513 | 0.043585228 | IRF1/STAT1 | 2 |
| GO:0002521 | BP | GO:0002521 | leukocyte differentiation | 5/23 | 518/15612 | 8.082284e-04 | 0.056975513 | 0.043585228 | CD79B/IFNG/IRF1/SH3PXD2A/UBD | 5 |
| GO:0045066 | BP | GO:0045066 | regulatory T cell differentiation | 2/23 | 29/15612 | 8.227632e-04 | 0.056975513 | 0.043585228 | IFNG/IRF1 | 2 |
| GO:0048534 | BP | GO:0048534 | hematopoietic or lymphoid organ development | 6/23 | 800/15612 | 8.461710e-04 | 0.056975513 | 0.043585228 | CD79B/IFNG/IRF1/SH3PXD2A/STAT1/UBD | 6 |
| GO:0009615 | BP | GO:0009615 | response to virus | 4/23 | 338/15612 | 1.379932e-03 | 0.082882012 | 0.063403228 | IFNG/IRF1/RNASE6/STAT1 | 4 |
| GO:0030099 | BP | GO:0030099 | myeloid cell differentiation | 4/23 | 339/15612 | 1.395044e-03 | 0.082882012 | 0.063403228 | IFNG/SH3PXD2A/STAT1/UBD | 4 |
| GO:0032735 | BP | GO:0032735 | positive regulation of interleukin-12 production | 2/23 | 40/15612 | 1.565186e-03 | 0.087824308 | 0.067183994 | IFNG/IRF1 | 2 |
| GO:0006775 | BP | GO:0006775 | fat-soluble vitamin metabolic process | 2/23 | 43/15612 | 1.807145e-03 | 0.096064013 | 0.073487218 | CYP4F3/IFNG | 2 |
| GO:0019815 | CC | GO:0019815 | B cell receptor complex | 1/23 | 4/16055 | 5.718533e-03 | 0.099868919 | 0.084100143 | CD79B | 1 |
Several nice plots can be drawn as described in Chapter 15 of the manual: http://yulab-smu.top/biomedical-knowledge-mining-book/enrichplot.html.
Among available plots, we can draw:
# Code cell n°22
options(repr.plot.width = 12, repr.plot.height = 7)
barplot(enr_go, showCategory = 20,
label_format = function(x) stringr::str_wrap(x, width = 120)) + #to be able to see terms description in a single row : play on the number (eg. 120)
ggtitle("dotplot for ORA") ## uses ggplot2 you will see in session 4
# Code cell n°23
options(repr.plot.width = 12, repr.plot.height = 7)
dotplot(enr_go,
showCategory=20,
label_format = function(x) stringr::str_wrap(x, width= 120)) +
ggtitle("dotplot for ORA")
# Code cell n°24
options(repr.plot.width = 30, repr.plot.height= 10)
enr_go <- enrichplot::pairwise_termsim(enr_go, method = "JC", semData = "org.Hs.eg.db")
emapplot(enr_go, showCategory = 20)
Do not hesitate to rerun the above command if the figure displayed is not readable. The layout changes every time it is run.
# Code cell n°25
options(repr.plot.width = 30, repr.plot.height = 15)
treeplot(enr_go,
hclust_method = "average")
# Code cell n°26
upsetplot(enr_go)
One plot often generated like the upsetplot is the cnetplot that depicts the linkages of genes and biological concepts (e.g. GO terms or KEGG pathways) as a network.
For some reason, it fails in this version of clusterProfiler installed on the environment and we cannot update it because the last version uses a newer R version. But below is the command to use that should work on your computer and a picture of what you should obtain.
cnetplot(enr_go, categorySize="pvalue", showCategory = 5)
There are many more plots available. The package clusterProfiler is in constant development with new plots.
Similarly, we can run the ORA on Reactome pathways. It uses the gene IDs rather than the gene symbols. So we have to get them.
# Code cell n°27
geneID_sig <- subset(limma.full.outs$limma.fullmodel.out, adj.P.Val < 0.01 )[, "ENTREZ_GENE_ID"]
length(geneID_sig)
# Code cell n°28
enr_react <- enrichPathway(gene = geneID_sig, pvalueCutoff = 0.10, readable = TRUE)
head(enr_react)
| ID | Description | GeneRatio | BgRatio | pvalue | p.adjust | qvalue | geneID | Count | |
|---|---|---|---|---|---|---|---|---|---|
| <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | <chr> | <int> | |
| R-HSA-877312 | R-HSA-877312 | Regulation of IFNG signaling | 2/20 | 14/10867 | 0.0002889939 | 0.03236732 | 0.02159849 | IFNG/STAT1 | 2 |
| R-HSA-877300 | R-HSA-877300 | Interferon gamma signaling | 3/20 | 91/10867 | 0.0005841200 | 0.03271072 | 0.02182764 | IFNG/IRF1/STAT1 | 3 |
You can draw any figures as above.
# Code cell n°29
options(repr.plot.width = 12, repr.plot.height = 7)
dotplot(enr_react,
showCategory = 20,
label_format = function(x) stringr::str_wrap(x, width= 120)) +
ggtitle("dotplot for ORA")
Similarly, we can run the ORA on KEGG pathways. The trick here is not to use gene name symbols, nor Gene IDs but their matching Uniprot IDs. The function bitr() generates the correspondance when available.
# Code cell n°30
mygenes2 <- clusterProfiler::bitr(mygenes, fromType = 'SYMBOL', toType = 'UNIPROT', OrgDb = 'org.Hs.eg.db')
mygenes2
'select()' returned 1:many mapping between keys and columns Warning message in clusterProfiler::bitr(mygenes, fromType = "SYMBOL", toType = "UNIPROT", : “29.41% of input gene IDs are fail to map...”
| SYMBOL | UNIPROT | |
|---|---|---|
| <chr> | <chr> | |
| 1 | ASB1 | Q9Y576 |
| 2 | ASB1 | B9A047 |
| 3 | BIK | A0A024R4X6 |
| 4 | BIK | Q13323 |
| 7 | CASP6 | P55212 |
| 8 | CD79B | P40259 |
| 9 | CYP4F3 | A0A024R7J8 |
| 10 | CYP4F3 | Q08477 |
| 11 | CYP4F3 | A0A024R7I2 |
| 12 | DPP7 | Q9UHL4 |
| 13 | EXOC4 | Q6NX51 |
| 14 | EXOC4 | Q96A65 |
| 17 | FBXL6 | Q8N531 |
| 18 | GLYATL2 | A0A024R4Z5 |
| 19 | GLYATL2 | Q8WU03 |
| 21 | IFNG | P01579 |
| 22 | IRF1 | P10914 |
| 23 | IRF1 | Q6FHN8 |
| 24 | KIAA1671 | Q9BY89 |
| 29 | MSH3 | P20585 |
| 30 | NME6 | O75414 |
| 31 | NME6 | C9JQB1 |
| 32 | NME6 | A0A0C4DG91 |
| 33 | PCBP4 | A0A024R320 |
| 34 | PCBP4 | P57723 |
| 35 | PCBP4 | C9J0A4 |
| 36 | RNASE6 | Q6IB39 |
| 37 | RNASE6 | Q93091 |
| 38 | SH3PXD2A | B3KPL1 |
| 39 | SH3PXD2A | Q5TCZ1 |
| 40 | SPACA3 | Q05C28 |
| 41 | SPACA3 | E9PF91 |
| 42 | SPACA3 | Q8IXA5 |
| 43 | SPACA3 | A0A080YUZ7 |
| 44 | ST3GAL6 | A0A087WXB8 |
| 45 | ST3GAL6 | Q9Y274 |
| 46 | STARD13 | Q9Y3M8 |
| 47 | STARD13 | B3KRK6 |
| 48 | STARD13 | B2R789 |
| 49 | STARD13 | A0A024RDV4 |
| 50 | STAT1 | P42224 |
| 51 | TIMM10 | P62072 |
| 52 | TSPAN12 | A0A024R740 |
| 53 | TSPAN12 | O95859 |
| 54 | UBD | A0A1U9X8S6 |
| 55 | UBD | O15205 |
As you can see, several UniprotIDs may exist for a given gene.
# Code cell n°31
table(mygenes2$SYMBOL)
ASB1 BIK CASP6 CD79B CYP4F3 DPP7 EXOC4 FBXL6
2 2 1 1 3 1 2 1
GLYATL2 IFNG IRF1 KIAA1671 MSH3 NME6 PCBP4 RNASE6
2 1 2 1 1 3 3 2
SH3PXD2A SPACA3 ST3GAL6 STARD13 STAT1 TIMM10 TSPAN12 UBD
2 4 2 4 1 1 2 2
Thus, duplicated IDs must be removed if any UNIPROT IDs taken.
# Code cell n°32
mygenes2 <- mygenes2[!duplicated(mygenes2$SYMBOL), ]
mygenes2
mygeneList2 <-mygenes2$UNIPROT
mygeneList2
| SYMBOL | UNIPROT | |
|---|---|---|
| <chr> | <chr> | |
| 1 | ASB1 | Q9Y576 |
| 3 | BIK | A0A024R4X6 |
| 7 | CASP6 | P55212 |
| 8 | CD79B | P40259 |
| 9 | CYP4F3 | A0A024R7J8 |
| 12 | DPP7 | Q9UHL4 |
| 13 | EXOC4 | Q6NX51 |
| 17 | FBXL6 | Q8N531 |
| 18 | GLYATL2 | A0A024R4Z5 |
| 21 | IFNG | P01579 |
| 22 | IRF1 | P10914 |
| 24 | KIAA1671 | Q9BY89 |
| 29 | MSH3 | P20585 |
| 30 | NME6 | O75414 |
| 33 | PCBP4 | A0A024R320 |
| 36 | RNASE6 | Q6IB39 |
| 38 | SH3PXD2A | B3KPL1 |
| 40 | SPACA3 | Q05C28 |
| 44 | ST3GAL6 | A0A087WXB8 |
| 46 | STARD13 | Q9Y3M8 |
| 50 | STAT1 | P42224 |
| 51 | TIMM10 | P62072 |
| 52 | TSPAN12 | A0A024R740 |
| 54 | UBD | A0A1U9X8S6 |
You do the same gene SYMBOL to UNIPROT conversion for the universe:
# Code cell n°33
prot_universe <- unique(probes$TargetID)
prot_universe <- clusterProfiler::bitr(prot_universe, fromType = 'SYMBOL', toType = 'UNIPROT', OrgDb = 'org.Hs.eg.db')
prot_universe <- prot_universe[!duplicated(prot_universe$SYMBOL), ]
prot_universe <- prot_universe$UNIPROT
'select()' returned 1:many mapping between keys and columns Warning message in clusterProfiler::bitr(prot_universe, fromType = "SYMBOL", toType = "UNIPROT", : “53.77% of input gene IDs are fail to map...”
You are now ready for the enrichment analysis on KEGG.
Unfortunately, the installed version of clusterProfiler in adenine is no longer able to run it and we cannot update it to a newer version which is based on R 4.2.0. But we put the commands below and an image of the expected results. On you personnal computer, it whould work.
enr_kegg <- enrichKEGG(gene = mygeneList2,
keyType = "uniprot",
minGSSize = 3,
maxGSSize = 800,
pvalueCutoff = 0.10,
universe = prot_universe,
organism = "hsa",
pAdjustMethod = "none")
dotplot(enr_kegg)
enr_kegg <- pairwise_termsim(enr_kegg, method = "JC", semData = "org.Hs.eg.db")
emapplot(enr_kegg)
enr_kegg2 <- setReadable(enr_kegg, OrgDb = "org.Hs.eg.db", "UNIPROT")
cnetplot(enr_kegg2, categorySize="pvalue", showCategory = 5)
GSEA, standing for Gene Set Enrichment Analysis, is a Significance Analysis of Function and Expression (SAFE) method. While ORAs (over-representation methods) are single-gene approaches, GSEA uses all genes in the dataset and their potential correlation. In addition, it uses quantitative metrics like gene expression or statistics results from a DGE analysis. This later is the most recommended approach to follow.
GSEA was developped at the Broad Institue. The documentation is available here. It was described in Subramanian et al. 2005
Finally we can run a Gene Set Enrichment Aanalysis (GSEA) directly with clusterProfiler. We need first to sort genes by a quantitative metrics, here the statistics of the DGE analysis.
We do it here for example with the Pat versus Control DGE results on GO terms.
# Code cell n°34
mygenes <- limma.full.outs[[6]]$B
names(mygenes) <- limma.full.outs[[6]]$TargetID
mygenes <- sort(mygenes, decreasing = T)
head(mygenes)
# Code cell n°35
gsea_go <- gseGO(geneList = mygenes,
ont ="ALL",
keyType = "SYMBOL",
minGSSize = 3,
maxGSSize = 800,
pvalueCutoff = 0.25,
verbose = TRUE,
OrgDb = org.Hs.eg.db,
pAdjustMethod = "none",
nPerm = 100) # by default there are 1000 permutations
preparing geneSet collections... GSEA analysis... Warning message in .GSEA(geneList = geneList, exponent = exponent, minGSSize = minGSSize, : “We do not recommend using nPerm parameter incurrent and future releases” Warning message in fgsea(pathways = geneSets, stats = geneList, nperm = nPerm, minSize = minGSSize, : “You are trying to run fgseaSimple. It is recommended to use fgseaMultilevel. To run fgseaMultilevel, you need to remove the nperm argument in the fgsea function call.” Warning message in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : “There are duplicate gene names, fgsea may produce unexpected results.” Warning message in fgseaSimple(...): “There were 196 pathways for which P-values were not calculated properly due to unbalanced gene-level statistic values” leading edge analysis... done...
It may take a long time and use a lot of memory due to permutations.
Let's see the result of the top enriched pathway.
# Code cell n°36
gseaplot2(gsea_go, geneSetID = 1, title = gsea_go$Description[1])
We can also display several patways on the same plot.
# Code cell n°37
gseaplot2(gsea_go, geneSetID = 1:3, title = "top 3 gene sets", pvalue_table = TRUE,
color = c("#E495A5", "#86B875", "#7DB0DD"), ES_geom = "dot")
We can also look at the usual figures implemented in clusterProfiler, such as dotplots :
# Code cell n°38
options(repr.plot.width = 17, repr.plot.height = 10)
dotplot(gsea_go, showCategory=20,
label_format = function(x) stringr::str_wrap(x, width= 120))+
ggtitle("dotplot for GSEA")
or emmaplots:
# Code cell n°39
options(repr.plot.width = 12, repr.plot.height = 7)
gsea_go <- enrichplot::pairwise_termsim(gsea_go, method = "JC", semData = "org.Hs.eg.db")
emapplot(gsea_go, showCategory = 20)
or the cneplots :
Here again, we provide you the commands and results but it does not run on this clusterProfiler version.
options(repr.plot.width=8, repr.plot.height= 8)
cnetplot(gsea_go, categorySize="pvalue", showCategory = 5)
The size of the plot can impact the figure: see the same figure with higher height!
options(repr.plot.width = 13, repr.plot.height = 13)
cnetplot(gsea_go, categorySize="pvalue", showCategory = 5)
We can also use the ridgeplot format which visualizes expression distributions of core enriched genes for GSEA enriched categories. It helps users to interpret up/down-regulated pathways.
# Code cell n°40
options(repr.plot.width = 30, repr.plot.height = 7)
ridgeplot(gsea_go, label_format = function(x) stringr::str_wrap(x, width= 120))
Picking joint bandwidth of 0.225
The x axis is the log2(FC) distribution of genes in pathways. Here all enriched pathways are downregulated!
The GSEA Java application is a stand-alone software to run GSEA on your own computer.
Two GSEA methods are implemented in the Java application:
Whatever the method used, GSEA can only compare two conditions.
GSEA requires two input files:
The format for the different inputs is described here
There are 4 acceptable formats for expression input files: .gct, .res, .pcl and .txt. Here we will prepare files in .txt format with R. It requires to fill a first column NAMES with the names of the genes/probes, followed by a column DESCRIPTION that must be with NA values (a bug in GSEA requires this column although it is not informative; they will try to fix this issue in a future release), then a column for each sample.
We use the norm.quant object.
# Code cell n°41
gsea_T1D_norm.quant <- data.frame("NAMES" = row.names(norm.quant), "DESCRIPTION" = NA, as.data.frame(norm.quant))
names(gsea_T1D_norm.quant)[-(1:2)] <- samples_info$array.labels
str(gsea_T1D_norm.quant)
write.table(gsea_T1D_norm.quant, file="gsea_T1D_norm.quant.txt", sep="\t", col.names = TRUE, quote = FALSE, row.names = FALSE)
'data.frame': 47323 obs. of 266 variables: $ NAMES : chr "6450255" "2570615" "6370619" "2600039" ... $ DESCRIPTION : logi NA NA NA NA NA NA ... $ 5753669129_B: num 7.35 7.09 6.3 6.57 6.49 ... $ 5753669129_C: num 7.06 6.91 6.75 6.61 6.49 ... $ 5753669129_E: num 7.05 6.87 6.55 6.57 6.77 ... $ 5753669129_K: num 7.4 6.79 6.57 6.39 6.51 ... $ 5753669129_A: num 7.22 6.64 6.63 6.44 6.94 ... $ 5753669129_F: num 6.87 6.77 6.47 6.78 6.43 ... $ 5753669095_B: num 7.02 6.65 6.4 6.47 6.61 ... $ 5753669095_F: num 6.99 6.84 6.32 6.54 6.62 ... $ 5753669095_G: num 7.29 6.82 6.46 6.49 6.43 ... $ 5753669095_K: num 7.08 6.68 6.43 6.62 6.45 ... $ 5753669095_D: num 7.2 7 6.74 6.52 6.4 ... $ 5753669095_E: num 7.34 6.9 6.62 6.73 6.83 ... $ 5753669095_A: num 6.9 6.86 6.59 6.62 6.53 ... $ 5753669095_H: num 7.29 6.56 6.44 6.6 6.38 ... $ 5753669095_C: num 6.89 6.81 6.55 6.73 6.69 ... $ 5753669095_J: num 7 6.95 6.58 6.49 6.76 ... $ 5753669095_I: num 7.02 6.81 6.54 6.55 6.51 ... $ 5753669095_L: num 7.01 6.87 6.63 6.67 6.38 ... $ 5753669094_B: num 6.87 6.8 6.43 6.63 6.44 ... $ 5753669094_D: num 6.86 6.68 6.63 6.69 6.59 ... $ 5753669094_E: num 7.26 6.46 6.69 6.69 6.5 ... $ 5753669094_I: num 7.57 6.81 6.4 6.64 6.54 ... $ 5753669094_A: num 6.88 6.88 6.44 6.74 6.35 ... $ 5753669094_F: num 6.85 6.63 6.44 6.71 6.69 ... $ 5753669102_D: num 6.7 6.89 6.61 6.68 6.38 ... $ 5753669102_B: num 7.12 6.63 6.63 6.22 6.58 ... $ 5753669102_K: num 7.01 6.81 6.51 6.34 6.76 ... $ 5753669102_A: num 6.62 6.94 6.66 6.56 6.41 ... $ 5753669102_J: num 7.15 6.99 6.34 6.67 6.59 ... $ 5753669094_J: num 6.93 6.7 6.53 6.52 6.76 ... $ 5753669094_K: num 6.88 6.57 6.52 6.69 6.35 ... $ 5753669094_G: num 6.89 6.96 6.42 6.61 6.42 ... $ 5753669094_L: num 7.04 6.67 6.85 6.62 6.7 ... $ 5753669094_C: num 6.85 6.85 6.68 6.58 6.47 ... $ 5753669094_H: num 6.68 6.59 6.47 6.5 6.53 ... $ 5753669028_G: num 7.13 6.79 6.55 6.57 6.43 ... $ 5753669028_H: num 7.07 6.9 6.75 6.35 6.43 ... $ 5753669028_B: num 7.46 7.14 6.82 6.37 6.55 ... $ 5753669028_J: num 7.54 6.64 6.62 6.48 6.53 ... $ 5753669028_C: num 7.62 6.83 6.69 6.55 6.78 ... $ 5753669028_D: num 7.31 6.92 6.6 6.34 6.54 ... $ 5753669028_A: num 6.93 6.91 6.54 6.47 6.51 ... $ 5753669028_K: num 7.22 6.72 6.5 6.49 6.72 ... $ 5753669028_E: num 7.23 6.75 6.56 6.33 6.64 ... $ 5753669028_I: num 7.5 6.67 6.67 6.46 6.17 ... $ 5753669028_F: num 7.1 6.65 6.63 6.47 6.57 ... $ 5753669028_L: num 7.19 6.74 6.68 6.47 6.6 ... $ 5753669136_D: num 7.17 6.74 6.38 6.54 6.16 ... $ 5753669136_F: num 7.11 6.87 6.53 6.52 6.52 ... $ 5753669136_I: num 7.27 6.83 6.6 6.72 6.47 ... $ 5753669136_J: num 7.57 6.96 7.16 6.57 6.39 ... $ 5753669136_B: num 7.19 6.97 6.53 6.63 6.62 ... $ 5753669136_K: num 6.97 6.78 6.52 6.41 6.31 ... $ 5753669111_I: num 7.01 7.03 6.62 6.51 6.44 ... $ 5753669111_F: num 7.52 6.72 6.47 6.55 6.33 ... $ 5753669111_L: num 7.25 6.97 6.48 6.24 6.18 ... $ 5753669111_J: num 7.13 6.61 6.67 6.82 6.46 ... $ 5753669111_K: num 6.72 7.11 6.48 6.49 6.69 ... $ 5753669129_G: num 7.78 6.79 6.44 6.55 6.57 ... $ 5753669129_H: num 7.41 6.7 6.55 6.85 6.46 ... $ 5753669129_D: num 7.27 7.16 6.91 6.57 6.67 ... $ 5753669129_J: num 7.53 6.94 6.57 6.7 6.36 ... $ 5753669129_I: num 7.67 6.59 6.68 6.61 6.77 ... $ 5753669129_L: num 7.23 6.58 6.51 6.45 6.68 ... $ 5753669096_B: num 7.42 6.71 6.47 6.37 6.47 ... $ 5753669096_C: num 7.23 6.84 6.43 6.76 6.41 ... $ 5753669096_H: num 7.33 6.29 6.68 6.57 6.69 ... $ 5753669096_I: num 7.44 6.77 6.76 6.75 6.26 ... $ 5753669096_A: num 7.26 6.61 6.67 6.54 6.29 ... $ 5753669096_J: num 7.23 6.71 6.5 6.57 6.67 ... $ 5753669138_A: num 7.37 6.8 6.68 6.64 6.54 ... $ 5753669138_H: num 7.31 6.89 6.77 6.64 6.48 ... $ 5753669138_J: num 7.41 6.86 6.71 6.43 6.33 ... $ 5753669138_L: num 7.63 6.98 6.48 6.68 6.35 ... $ 5753669138_C: num 7.13 6.85 6.54 6.75 6.42 ... $ 5753669096_E: num 6.87 6.89 6.43 6.62 6.36 ... $ 5753669096_L: num 7.17 6.61 6.59 6.47 6.59 ... $ 5753669096_F: num 6.94 6.75 6.56 6.61 6.3 ... $ 5753669096_K: num 7.19 6.61 6.46 6.43 6.59 ... $ 5753669096_D: num 7.34 6.63 6.4 6.63 6.64 ... $ 5753669096_G: num 7.14 6.55 6.64 6.48 6.43 ... $ 5753669109_I: num 7.19 6.76 6.36 6.47 6.6 ... $ 5753669109_K: num 7.24 6.91 6.64 6.5 6.46 ... $ 5753669109_A: num 7.17 6.67 6.62 6.53 6.16 ... $ 5753669109_F: num 6.96 6.8 6.72 6.49 6.49 ... $ 5753669109_B: num 7.23 6.72 6.57 6.45 6.66 ... $ 5753669109_L: num 7.34 6.7 6.33 6.35 6.73 ... $ 5753669138_D: num 7.16 6.82 6.53 6.44 6.46 ... $ 5753669138_B: num 7.34 6.75 6.61 6.57 6.54 ... $ 5753669138_G: num 7.57 7.09 6.56 6.51 6.53 ... $ 5753669138_F: num 7.34 7 6.65 6.51 6.49 ... $ 5753669138_I: num 7.39 6.8 6.41 6.77 6.48 ... $ 5753669136_A: num 7.14 6.71 6.63 6.68 6.47 ... $ 5753669136_E: num 7 6.72 6.55 6.57 6.63 ... $ 5753669136_G: num 7.14 6.77 6.67 6.58 6.33 ... $ 5753669136_H: num 7.53 6.62 6.62 6.83 6.49 ... $ 5753669136_C: num 7.1 6.88 6.61 6.5 6.42 ... [list output truncated]
The format is RNK: Ranked list file format (*.rnk)
Here are the command for the 5th contrast as an example. You can do the same for other constrats.
# Code cell n°42
gsea_T1D_PatVsCont <- data.frame(limma.full.outs[[6]]$ProbeID, limma.full.outs[[6]]$B)
write.table(gsea_T1D_PatVsCont, file="gsea_T1D_PatVsCont.rnk", sep="\t", col.names = FALSE, quote = FALSE, row.names = FALSE)
The format .cls format is the one for phenotypes.
On the second line, you put names for each class as they will appear in analysis report. The line should begin with a hashtag sign (#) followed by a space.
The third line contains a class label for each sample. The class label can be the class name, a number, or a text string. The first label used is assigned to the first class named on the second line; the second unique label is assigned to the second class named; and so on. The number of class labels specified on this line should be the same as the number of samples specified in the first line. The number of unique class labels specified on this line should be the same as the number of classes specified in the first line.
Here, we will prepare two phenotype files.
# Code cell n°43
temp <- samples_info$Status
temp <- ifelse(temp == 2, "PAT", "CONT")
gsea_pheno_PatVsCont <- file("gsea_pheno_PatVsCont.cls")
my_text <- paste("264 2 1\n#PAT CONT\n", paste(temp, collapse=" "))
writeLines(my_text, gsea_pheno_PatVsCont)
close(gsea_pheno_PatVsCont)
rm(temp)
You should see the gsea_pheno_PatVsCont file in the left column. Open it to see its structure.
# Code cell n°44
str(samples_info)
'data.frame': 264 obs. of 8 variables: $ array.labels: chr "5753669129_B" "5753669129_C" "5753669129_E" "5753669129_K" ... $ PedID : chr "1" "1" "1" "1" ... $ ID : chr "1" "1" "1" "1" ... $ Status : chr "2" "2" "2" "2" ... $ Stim : chr "0" "0" "6" "6" ... $ Full : chr "1_1_2_0" "1_1_2_0" "1_1_2_6" "1_1_2_6" ... $ Sex : chr "1" "1" "1" "1" ... $ Age : int 10 10 10 10 10 10 18 18 18 18 ...
# Code cell n°45
temp <- samples_info$Status
temp <- ifelse(temp == 2, "PAT", "CONT")
temp <- paste(temp, samples_info$Stim, sep = "")
temp
# Code cell n°46
gsea_pheno_groups <- file("gsea_pheno_groups.cls")
my_text <- paste("264 6 1\n#PAT0 PAT6 PAT24 CONT0 CONT6 CONT24\n", paste(temp, collapse=" "))
writeLines(my_text, gsea_pheno_PatVsCont)
close(gsea_pheno_groups)
You are ready to run your first GSEA analyses!
Download the Java application from here with the appropriate version for your OS. Unzip the downloaded file and launch GSEA by double-clicking on gsea.bat if you are on Windows, by double-clicking on the App or on the gsea.command if you are on macOS, or by typing the command ./gsea.sh if you are on Linux or on macOS.
In File>Preferences, specify the directoty where you want the GSEA results to be saved. By default it is in your home repository /gsea_home/output.
Click on Load Data and upload the different input files. No error must be reported if your formats are correct.
Click on Run GSEA
Select the correct in Required fields the correct filed and parameters:
gsea_T1D_norm.quant.txt for the expression dataset.cls file and the contrast of interestcollapse to deal with several probes for the same genephenotype for permutation if you have several samples per category (7 are recommanded), otherwise select gene_set but the statistical test will not be able to account for putative correlations between genes.Human_ILLUMINA_Array_MSigDB.v2022.1.Hs.chip since we are working with Illumina array data and we kept the probeID identifiers. Should you be working with unique gene symbols, select the HGNC platform.In Basic fields enter a useful name for your analysis and select a method to rank genes. You can also specify the output folder name.
Click on Run at the bottom.
You can then run a Leading edge analysis and vizualize enrichment map. The enrichment outputs can also be imported to Cytoscape using the Cytoscape app EnrichmentMap.
Click on Run GSEAPreranked
Select in Required fileds the correct files and parameters:
gsea_T1D_PatVsCont.rbkcollapse to deal with several probes for the same geneHuman_ILLUMINA_Array_MSigDB.v2022.1.Hs.chip since we are working with Illumina array data and we kept the probeID identifiers. Should you be working with unique gene symbols, select the HGNC platform.Success in green in the "GSEA Reports" frame on the left panel. Just click on Success to look at the results. Follow the guide to understand the outputs. A FDR of 25% is the minimum required to consider a significant enrichment.
Note: GSEA in R:
- an R package exist to run GSEA: GSEABase
- the package clusterProfiler implements GSEA (see part 3.2 of this tutorial) and even a fastest approach (fgsea).
[version 13/11/2022 - last revision:@SCaburet]